Working on your database design with Designer/2000

This article is reprinted from the October 1996 issue of Inside Solaris, a monthly publication of The Cobb Group.

Writing a script to monitor your system

By Marco C. Mason

No matter what precautions you take, occasions always pop up when you need to drop whatever you're doing and attend to some urgent task. You just can't avoid it. As long as people use computers, computers will run out of resources. As you know, once a computer runs out of resources, such as disk or swap space, recovery can be difficult. If the system crashes, you may spend hours getting it running correctly again.

If you could keep a close eye on your system, you could find out when a catastrophe is imminent and take steps to avert it. Any system administrator would gladly spend a few minutes to prevent a multihour recovery operation. In this article, we'll show you how to build some tools that help you monitor your system and prevent major breakdowns.

What do you want to monitor?

The resources you should monitor vary depending on the applications you run and the physical parameters of the system. Some of the situations you may want a warning about include:

Low disk space
Low swap space
Too many processes
Persistent warning messages
Very high CPU usage over extended periods
A high rate of network errors
Heavy disk usage

The monitoring script we create in this article will monitor disk space on the / and /tmp file systems, as well as available swap space and the number of processes running. Using the basic template we put together here, you can monitor as many things as you like.

Monitoring free space on a file system

One common cause of program failure is a file system running out of space. The amount of free space on a file system normally decreases as you accumulate data. You can use the df command to see how much disk space is free on all your file systems, like this:

# df -b /
Filesystem              avail
/dev/dsk/c0t0d0s0      354283

As you can see, the / file system has about 350 megabytes of free space. You can examine the output of the df command to decide which file systems are too full for comfort.

For the purposes of the script that we'll put together later, we want only the number in the second column of the second line. To do so, we pipe the results of df -b to awk, telling it that we want only the second field on the second line, like this:

$ df -b / | awk 'NR==2 { print $2 }'
354283

We can place this value in a variable by enclosing the expression in grave accents (') and treating it just like a value in an assignment statement. The completed statement that puts the free space of the / directory into the TEMP variable is

TEMP='df -b / | awk 'NR==2 { print $2 }' '

Monitoring free swap space

Another major catastrophe occurs when the system runs out of swap space. In this case, Solaris must start killing jobs to free swap space. Since Solaris doesn't know which jobs are the most important, it may easily kill your mission-critical jobs. If you've installed Solaris in the normal way, the swap area shares a disk slice with the /tmp file system. In this case, you don't necessarily need any special code to check for low swap space. Instead, you can simply use the check for low free space on the /tmp file system.

On the other hand, if you've separated the swap space from the /tmp file system, you need a different method of finding out how much free swap space you have. In this case, you can use the swap -l command to list the swap areas, like this:

$ swap -l
swapfile	dev	swaplo	blocks	free
/dev/dsk/c0t0d0s1	102,1	8 	131752	110784
/extra_swap	-	8    	992    	992
/extra_swap_2	- 	8   	1408   	1408

As you can see here, this system has three swap areas, with a total of 113,184 blocks free (nearly 60MB). If you're going to write a script to monitor your swap space, all you need to do is add the amount of free space for all swapping partitions and compare the result to a threshold value to see if you're running dangerously low.

You can pipe the output of swap -l to a simple awk script to compute the total free space. The awk script must simply add together all the values in the fifth column for all lines after the first. At the command line, type the following command to get the amount of free swap space:

$ swap -l | awk 'BEGIN {ttl=0} NR>1 {ttl+=$5} 
┬END {print ttl}'
113184

As you'd expect, we can place the amount of free swap space in a shell variable by enclosing the preceding expression in grave quotes and making the assignment, like this:

TEMP='swap -l | awk 'BEGIN {ttl=0} 
┬NR>1 {ttl+=$5} END {print ttl}''

How many processes are running?

Perhaps your system has a problem when too many processes are executing at once. If so, you may want to monitor the number of processes executing at any given time. Counting the number of active processes on the system is easy. We use the ps -A command to report all processes, one per line. Then we use wc -l to count the number of lines, as follows:

# ps -A | wc -l
      44

So, to put the number of processes in a shell variable, we can use this command:

TEMP='ps -A | wc -l'

Checking your system state with a shell script

You can check for many other things, but this is a good start for our system-monitoring script. Once we've obtained the information we want, we use basically the same structure to determine whether the system is in trouble. We use an if statement to see whether we've violated the limit. If we have, we append a warning message to a report file and set the STATUS variable to 1, as shown in Listing A. The blue lines of code use a here document, as described in the article "Automating Applications that Accept User Input" in the June issue. These lines add a failure warning record to the file specified by REPORT.

Finally, after the script checks all parameters, it decides whether to send E-mail and page the system administrator. It then deletes the temporary file it used to build the mail message. (On an early version of the ISOL_Monitor script, we inadvertently tested it. We forgot to delete the temporary file, and eventually the script told us that the /tmp file system was too full!)

if [ ${STATUS} -gt 0 ]; then
   mail ${SYSADMIN} <${REPORT}
   cu pgr_${SYSADMIN} >/dev/null
fi
rm ${REPORT}

Please note that for our purposes, we're assuming you created a paging system named pgr_SysAdmin, where SysAdmin is the username of your system administrator.

Listing A

if [ ${TEMP} -lt MIN_ROOT_SPC ]; then
   echo "   Not enough!"
   cat <<- XYZZY >>${REPORT}
	Insufficient space on /
	   (${TEMP} < ${MIN_ROOT_SPC})

	XYZZY
   STATUS=1
fi

We use this ISOL_Monitor structure throughout to warn the user about potential problems.

Listing B shows the entire ISOL_Monitor script we created to monitor the system and evaluate the results. The configuration section at the beginning sets the limits we're going to complain about if violated.

As you can see, we set obviously bad limits in order that you might see the script send you E-mail and page you. Also note that you need to change the SYSADMIN variable to your username. Once you install the script on your system, just tune these parameters to values that suit your needs.

Listing B

#! /usr/bin/ksh
#------------------------------------
# Monitor system statistics, and warn
# sysadmin(s) of any impending probs.
#------------------------------------

# CONFIGURATION
MIN_ROOT_SPC=1000000
MIN_TEMP_SPC=2000000
MIN_SWAP_SPC=1000000
MAX_PROCS=3
SYSADMIN=marco
PATH=/usr/sbin:/usr/bin

# By default, we're not going to send a
# page, or any E-Mail
STATUS=0
REPORT=/tmp/ISOL_Monitor_${$}
rm ${REPORT}

# Is there enough space on /?
TEMP='df -b / | awk 'NR==2 { print $2 }' '
echo ${TEMP} "blocks left on /"
if [ ${TEMP} -lt MIN_ROOT_SPC ]; then
   echo "   Not enough!"
   cat <<- XYZZY >>${REPORT}
	Insufficient space on /
	   (${TEMP} < ${MIN_ROOT_SPC})

	XYZZY
   STATUS=1
fi

# Is there enough space on /tmp?
TEMP='df -b /tmp | awk 'NR==2 { print $2 }' '
echo ${TEMP} "blocks left on /tmp"
if [ ${TEMP} -lt MIN_TEMP_SPC ]; then
   echo "   Not enough!"
   cat <<- XYZZY >>${REPORT}
	Insufficient space on /tmp
	   (${TEMP} < ${MIN_TEMP_SPC})

	XYZZY
   STATUS=1
fi

# Is there enough swap space?
TEMP='swap -l | awk 'BEGIN { total=0 } NR>=2 { total += $5 } ┬END { print total }' '
echo ${TEMP} "blocks of swap space left"
if [ ${TEMP} -lt MIN_SWAP_SPC ]; then
   echo "   Not enough!"
   cat <<- XYZZY >>${REPORT}
	Insufficient swap space
	   (${TEMP} < ${MIN_SWAP_SPC})

	XYZZY
   STATUS=1

fi

# Are there too many processes running?
TEMP='ps -A | wc -l'
echo ${TEMP} "processes currently running"
if [ ${TEMP} -gt MAX_PROCS ]; then
   echo "   Too many!"
   cat <<- XYZZY >>${REPORT}
	Too many processes!
	   (${TEMP} > ${MAX_PROCS})

	XYZZY
   STATUS=1
fi

# If we've detected any bad problems,
# E-Mail the report to the sysadmin
# and then issue a page
if [ ${STATUS} -gt 0 ]; then
   mail ${SYSADMIN} <${REPORT}
   cu pgr_${SYSADMIN} >/dev/null
fi
rm ${REPORT}

The ISOL_Monitor script monitors your system and alerts you when a resource is critically low.

Miscellaneous

The article "Configuring Your Computer to Page You," on page 1, shows how to set up your computer so it can page you when the ISOL_Monitor script detects a problem. You will probably want to execute this script frequently. For information on setting the script to run automatically, check out the article "Scheduling a Job for Periodic Execution" found on page 5". Finally, we rely heavily on awk for making this script work, so you'll want to refer to the man page on awk or read the article "An Introduction to awk" in our May issue.)

Conclusion

The ISOL_Monitor script we presented here is only a starting point on which you can build a more sophisticated monitoring system. There are many ways you can improve it. Here are some suggestions:

Monitor the system for other potential problems, such as missing processes.
Use cron to schedule the script for frequent execution.
For sites with multiple shifts, set the SYSADMIN variable based on the time of day.
For sites with external access, monitor hacking attempts.

Marco C. Mason is a freelance computer consultant and author based in Louisville, Kentucky.

[The Cobb Group Home Page]

Copyright (c) 1996 The Cobb Group, a division of Ziff-Davis Publishing Company. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of Ziff-Davis Publishing Company is prohibited. The Cobb Group and The Cobb Group logo are trademarks of Ziff-Davis Publishing Company.

Questions? Comments?